{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第八章 处理时间序列\n",
    "\n",
    "本章依赖的类库有：\n",
    "\n",
    "- numpy 快速操作结构数组的工具\n",
    "- scipy 科学计算库\n",
    "\n",
    "python类库安装教程见实体书附录A。\n",
    "____"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.1 Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 简单的文本识别\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re\n",
    "import numpy as np\n",
    "\n",
    "def process_text(text):\n",
    "    \"\"\"将标点符号替换成空格\"\"\"\n",
    "    dot_word = r'。|，|\\.|\\?|,|!|，|\\(|\\)|\\(|\\)| '\n",
    "    return re.sub(dot_word, ' ', text)\n",
    "\n",
    "def text2vec(text):\n",
    "    \"\"\"将文本转换成向量\"\"\"\n",
    "    cleaned_text = process_text(text) # 输入文本预处理\n",
    "    text_vec = cleaned_text.split()\n",
    "    vocab_list = list(set(text_vec)) # 词汇集\n",
    "    numer_vec = [0.]*len(vocab_list) # 数字向量\n",
    "    for word in text_vec:\n",
    "        # 每遇到一个单词，对应的数字向量位置+1\n",
    "        numer_vec[vocab_list.index(word)] += 1\n",
    "    return numer_vec / np.sum(numer_vec) # 归一化，让向量中所有数值加起来等于1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.03508772,  0.01754386,  0.01754386,  0.01754386,  0.01754386,\n",
       "        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386,\n",
       "        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386,\n",
       "        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386,\n",
       "        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386,\n",
       "        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386,\n",
       "        0.01754386,  0.07017544,  0.0877193 ,  0.01754386,  0.01754386,\n",
       "        0.03508772,  0.03508772,  0.05263158,  0.01754386,  0.01754386,\n",
       "        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = 'Without a doubt, a loving and friendly puppy or dog can put an instant smile on your face! When you adopt a dog from Atlanta Humane Society, you gain a wonderful canine companion. But most of all, when you adopt a rescue dog, you have the ability to bond with one of Atlanta’s forgotten and neglected animals.'\n",
    "\n",
    "text2vec(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 深度学习从读懂词义开始\n",
    "\n",
    "欧氏距离和余弦向量："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy import spatial\n",
    "\n",
    "def euc_distance(v1, v2):\n",
    "    \"\"\"用欧氏距离判断相似距离\"\"\"\n",
    "    return np.linalg.norm(np.array(v1) - np.array(v2))\n",
    "\n",
    "def cos_similar(v1, v2):\n",
    "    \"\"\"用余弦向量判断相似程度\"\"\"\n",
    "    return 1 - spatial.distance.cosine(np.array(v1), np.array(v2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "计算One-Hot编码的单词向量相似度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.41421356237\n",
      "1.41421356237\n",
      "0.0\n",
      "0.0\n"
     ]
    }
   ],
   "source": [
    "puppy_vec = [1.0, 0.0] + [0.0] * 998\n",
    "dog_vec = [0.0, 1.0] + [0.0] * 998\n",
    "some_word_vec = [0.0] * 499 + [1.0] + [0.0] * 500\n",
    "\n",
    "print(euc_distance(puppy_vec, dog_vec))\n",
    "print(euc_distance(puppy_vec, some_word_vec))\n",
    "print(cos_similar(puppy_vec, dog_vec))\n",
    "print(cos_similar(puppy_vec, some_word_vec))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.2 适合序列的模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### LSTM（Long Shore-Term Memory长短期记忆）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在Keras中，你可以按照下面的方式给模型增加一个LSTM层："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "mode.add(LSTM(32)) # 输出向量长度32"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在Caffe的prototxt文件中，以下面的方式配置一个LSTM层：\n",
    "\n",
    "    layer {\n",
    "      name: \"lstm\" # lstm层\n",
    "      type: \"Lstm\"\n",
    "      bottom: \"data\"\n",
    "      bottom: \"xxx\"\n",
    "      top: \"lstm\"\n",
    "\n",
    "      lstm_param {\n",
    "        num_output: 32 # lstm输出向量维度\n",
    "        clipping_threshold: 0.1 # 为了解决梯度爆炸，设置的梯度最大阈值\n",
    "        weight_filler {\n",
    "          type: \"gaussian\" # 初始化weight的方式\n",
    "          std: 0.1\n",
    "        }\n",
    "        bias_filler {\n",
    "          type: \"constant\" # 初始化bias的方式\n",
    "        }\n",
    "      }\n",
    "    }"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
